This document explores some options for trained ensembles we could start using for COVID-19. We focus on results for incident cases and deaths only, because a complete set of results for hospitalizations and cumulative deaths is not available. The last forecast date evaluated is 2021-01-18 because complete results aren't available for the full history method described below for the week of 2021-01-25 (I stopped running estimation after 3 days).
These scores summarize model skill for each combination of base target and spatial scale.
For brevity, we'll look here at performance for a subset of the variations on "trained" approaches that we have considered. Below are the settings we're examining, and reasons we chose them from among the alternatives.
Within these settings, we explore variations in the training set window size (the number of past weeks of forecasts used to estimate ensemble weights). For state and national level forecasts, we consider a range of window sizes from 3 weeks to 10 weeks, and a "full history" method that goes back to the first week with forecasts from at least two component models. For county level forecasts, we restrict to just a window size of 3 weeks because a larger window size is computationally infeasible for generating forecasts in real time.
We also consider three quantile grouping strategies: "per model" weights, "per quantile" approaches where there is a separate weight parameter for each combination of model and quantile level, and "3 groups" of quantile levels: the three lowest, the three highest, and the middle ones.
Finally, we consider a "prospective selection" method that, each week, chooses the ensemble method with best average WIS in previous weeks. However, in these results that prospective selection was not able to choose the "full history" ensemble because those results were not available at the time the prospective selection method was run.
We compare to two "untrained" ensembles: an equally-weighted mean (ew) at each quantile level and a median at each quantile level.
We perform estimation either separately for each spatial scale (National, State, and County), or jointly across the State and National levels.
The overall average scores in the tables below are computed across a comparable set of forecasts for all models, determined by the model evaluated with the fewest available forecasts (corresponding to a training set window of 10). For incident deaths, the relative rankings of median and mean ("ew") can change as a few more weeks are added or removed from the evaluation set. Per-week scores plotted further down are computed across a comparable set of forecasts for all models that are available within each week.
National level mean scores across comparable forecasts for all methods.
State level mean scores across comparable forecasts for all methods:
County level mean scores across comparable forecasts for all methods:
National level mean scores across comparable forecasts for all methods:
State level mean scores across comparable forecasts for all methods:
In these plots we show results for the mean, median, the prospective selection method, and a variation on the method using the full history that had reasonably good performance. For state level results, this is the variation with 3 quantile groups, and estimation_grouping == "state"; for national level results, it's the variation with 3 quantile groups and estimation_grouping == "state_national".
This section displays heat maps showing score availability by date, target_variable, spatial scale, and model. In each cell, we expect to see a number of scores equal to the number of locations for the given spatial scale times the number of horizons for the given target.
Here we have subset the forecasts to those that are comparable across all models within each combination of base target and spatial scale. We expect to see the exact same score counts for all models within each plot facet. Average scores computed within a combination of base target and spatial scale will be comparable.
Here we have subset the forecasts to those that are comparable across all models within each combination of base target, spatial scale, and week. We expect to see the exact same score counts within each column of the plot, for all models for which any forecasts are available. Average scores computed within a combination of base target, spatial scale, and forecast week will be comparable.